Skip to content

SFT: Truncate during dataset preparation, not collation#6155

Open
qgallouedec wants to merge 11 commits into
mainfrom
truncate-during-dataset-preparation
Open

SFT: Truncate during dataset preparation, not collation#6155
qgallouedec wants to merge 11 commits into
mainfrom
truncate-during-dataset-preparation

Conversation

@qgallouedec

@qgallouedec qgallouedec commented Jun 23, 2026

Copy link
Copy Markdown
Member

Move sequence truncation in SFTTrainer from the data collator into _prepare_dataset.

Why

Truncation is a pure per-example slice. Doing it once during (cached) dataset preparation is cleaner than recomputing it in the collator on every batch, and keeps all dataset shaping (tokenize → build labels → truncate → pack) in one place.

It will also allow to drop rows with no trainable rewards, see #6025

Changes

  • _prepare_dataset now truncates input_ids and labels to max_length (respecting truncation_mode), right after labels are built. Skipped when packing (packing already chunks to max_length).
  • build_labels now drops the completion_mask / assistant_masks columns: they're fully baked into labels.
  • DataCollatorForLanguageModeling no longer truncates: the max_length and truncation_mode arguments are removed.

⚠️ Behavior change

With skip_prepare_dataset=True, preparation (and therefore truncation) is skipped. The dataset must already be truncated.


Note

Medium Risk
Changes default training data shape and breaks callers that relied on collator truncation or skip_prepare_dataset without pre-truncated data; core SFT path behavior shifts for long sequences and masked labels.

Overview
SFT sequence truncation moves from DataCollatorForLanguageModeling into _prepare_dataset, so input_ids and labels are sliced to max_length (with truncation_mode) once during cached preparation—after label building, skipped when packing.

DataCollatorForLanguageModeling no longer accepts max_length or truncation_mode; it only pads batches. Label building during preparation now drops assistant_masks / completion_mask after folding them into labels.

Behavior change: with skip_prepare_dataset=True, truncation is not applied; datasets must already be length-limited. Assistant-only truncation regression (#3927) is asserted on the prepared dataset, not via the collator.

Tests drop collator truncation cases and add preparation-focused coverage (test_dataset_truncated_to_max_length, updated #3927 expectations).

Reviewed by Cursor Bugbot for commit b3bbb4f. Bugbot is set up for automated code reviews on this repo. Configure here.

@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0579163751

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread trl/trainer/sft_trainer.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant